INTERSPEECH.2015 - Speech Synthesis | Cool Papers

#1 Phase perception of the glottal excitation of vocoded speech [PDF] [Copy] [Kimi¹]

Authors: Tuomo Raitio ; Lauri Juvela ; Antti Suni ; Martti Vainio ; Paavo Alku

While the characteristics of the amplitude spectrum of the voiced excitation have been studied widely both in natural and synthetic speech, the role of the excitation phase has remained less explored. Especially in speech synthesis, the phase information is often omitted for simplicity. This study investigates the impact of phase information of the excitation signal of voiced speech. The experiments in the study involve analysis-synthesis of speech using a vocoder that utilizes natural glottal flow pulses for reconstructing the voiced excitation. Firstly, the phase spectra of the glottal flow waveforms are converted to either zero-phase or random-phase. Secondly, the quality of vocoded speech using the two phase-modified pulses is compared in subjective listening tests to the corresponding signal excited with the natural-phase pulse. The results indicate that phase has a perceptually relevant effect in vocoded speech and the use of natural phase improves the synthesis quality.

#2 Using acoustics to improve pronunciation for synthesis of low resource languages [PDF] [Copy] [Kimi¹]

Authors: Sunayana Sitaram ; Serena Jeblee ; Alan W. Black

Some languages have very consistent mappings between graphemes and phonemes, while in other languages, this mapping is more ambiguous. Consonantal writing systems prove to be a challenge for Text to Speech Systems (TTS) because they do not indicate short vowels, which creates an ambiguity in pronunciation. Special letter-to-sound rules may be needed for some cases in languages that otherwise have a good correspondence between graphemes and phonemes. In the low-resource scenario, we may not have linguistic resources such as diacritizers or hand-written rules for the language. We propose a technique to automatically learn pronunciations iteratively from acoustics during TTS training and predict pronunciations from text during synthesis time. We conduct experiments on dialects of Arabic for disambiguating homographs and Hindi for discovering the schwa-deletion rules. We evaluate our systems using objective and subjective metrics of TTS and show significant improvements for dialects of Arabic. Our methods can be generalized to other languages that exhibit similar phenomena.

#3 Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum [PDF] [Copy] [Kimi¹]

Authors: Tadashi Inai ; Sunao Hara ; Masanobu Abe ; Yusuke Ijima ; Noboru Miyazaki ; Hideyuki Mizuno

As described in this paper, we propose a sub-band speech synthesis approach to develop a high quality Text-to-Speech (TTS) system: a sample-based spectrum is used in the high-frequency band and spectrum generated by HMM-based TTS is used in the low-frequency band. Herein, sample-based spectrum means spectrum selected from a phoneme database such that it is the most similar to spectrum generated by HMM-based speech synthesis. A key idea is to compensate over-smoothing caused by statistical procedures by introducing a sample-based spectrum, especially in the high-frequency band. Listening test results show that the proposed method has better performance than HMM-based speech synthesis in terms of clarity. It is at the same level as HMM-based speech synthesis in terms of smoothness. In addition, preference test results among the proposed method, HMM-based speech synthesis, and waveform speech synthesis using 80 min speech data reveal that the proposed method is the most liked.

#4 Pruning redundant synthesis units based on static and delta unit appearance frequency [PDF] [Copy] [Kimi¹]

Authors: Heng Lu ; Wei Zhang ; Xu Shao ; Quan Zhou ; Wenhui Lei ; Hongbin Zhou ; Andrew Breen

In order to reduce the footprint of concatenative speech synthesis systems for embedded devices, a novel method for pruning redundant units is introduced in this work. Instead of using only a unit appearance frequency-based pruning criterion, as in the conventional method, the new method introduces the concept of “delta unit appearance frequency” which indicates whether a unit is replaceable or not. Both static and delta unit appearance frequency are included in this proposed method as pruning criteria. Only units with comparatively high appearance frequency and which cannot be replaced by other units are preserved in the database. Experiments show that the new method can reduce the footprint of our speech synthesis system greatly without losing much synthesis voice quality.

#5 Emotional transplant in statistical speech synthesis based on emotion additive model [PDF] [Copy] [Kimi¹]

Authors: Yamato Ohtani ; Yu Nasu ; Masahiro Morita ; Masami Akamine

This paper proposes a novel method to transplant emotions to a new speaker in statistical speech synthesis based on an emotion additive model (EAM), which represents the differences between emotional and neutral voices. This method trains EAM using neutral and emotional speech data of multiple speakers and applies it to a neutral voice model of a new speaker (target). There is some degradation in speech quality due to a mismatch in speakers between the EAM and the target neutral voice model. To alleviate the mismatch, we introduce an eigenvoice technique to this framework. We build neutral voice models and EAMs using multiple speakers, and construct an eigenvoice space consisting the neutral voice models and EAMs. To transplant the emotion to the target speaker, the proposed method estimates weights of eigenvoices for the target neutral speech data based on a maximum likelihood criteria. The EAM of the target speaker is obtained by applying the estimated weights to the EAM parameters of the eigenvoice space. Emotional speech is generated using the EAM and the neutral voice model. Experimental results show that the proposed method performs emotional speech synthesis with reasonable emotions and high speech quality.

#6 Generalized variable parameter HMMs based acoustic-to-articulatory inversion [PDF] [Copy] [Kimi¹]

Authors: Xurong Xie ; Xunying Liu ; Lan Wang ; Rongfeng Su

Acoustic-to-articulatory inversion is useful for a range of related research areas including language learning, speech production, speech coding, speech recognition and speech synthesis. HMM-based generative modelling methods and DNN-based approaches have become dominant approaches in recent years. In this paper, a novel acoustic-to-articulatory inversion technique based on generalized variable parameter HMMs (GVP-HMMs) is proposed. It leverages the strengths of both generative and neural network based modelling frameworks. On a Mandarin speech inversion task, a tandem GVP-HMM system using DNN bottleneck features as auxiliary inputs significantly outperformed the baseline HMM, multiple regression HMM (MR-HMM), DNN and deep mixture density network (MDN) systems by 0.20mm, 0.16mm, 0.12mm and 0.10mm respectively in terms of electromagnetic articulography (EMA) root mean square error (RMSE).

#7 Semi-supervised training of a voice conversion mapping function using a joint-autoencoder [PDF] [Copy] [Kimi¹]

Authors: Seyed Hamidreza Mohammadi ; Alexander Kain

Recently, researchers have begun to investigate Deep Neural Network (DNN) architectures as mapping functions in voice conversion systems. In this study, we propose a novel Stacked-Joint-Autoencoder (SJAE) architecture, which aims to find a common encoding of parallel source and target features. The SJAE is initialized from a Stacked-Autoencoder (SAE) that has been trained on a large general-purpose speech database. We also propose to train the SJAE using unrelated speakers that are similar to the source and target speaker, instead of using only the source and target speakers. The final DNN is constructed from the source-encoding part and the target-decoding part of the SJAE, and then fine-tuned using back-propagation. The use of this semi-supervised training approach allows us to use multiple frames during mapping, since we have previously learned the general structure of the acoustic space and also the general structure of similar source-target speaker mappings. We train two speaker conversions and compare several system configurations objectively and subjectively while varying the number of available training sentences. The results show that each of the individual contributions of SAE, SJAE, and using unrelated speakers to initialize the mapping function increases conversion performance.

#8 On glottal source shape parameter transformation using a novel deterministic and stochastic speech analysis and synthesis system [PDF] [Copy] [Kimi¹]

Authors: Stefan Huber ; Axel Roebel

In this paper we present a flexible deterministic plus stochastic model (DSM) approach for parametric speech analysis and synthesis with high quality. The novelty of the proposed speech processing system lies in its extended means to estimate the unvoiced stochastic component and to robustly handle the transformation of the glottal excitation source. It is therefore well suited as speech system within the context of Voice Transformation and Voice Conversion. The system is evaluated in the context of a voice quality transformation on natural human speech. The voice quality of a speech phrase is altered by means of re-synthesizing the deterministic component with different pulse shapes of the glottal excitation source. A subjective listening test suggests that the speech processing system is able to successfully synthesize and arise to a listener the perceptual sensation of different voice quality characteristics. Additionally, improvements of the speech synthesis quality compared to a baseline method are demonstrated.

#9 Fluent personalized speech synthesis with prosodic word-level spontaneous speech generation [PDF] [Copy] [Kimi¹]

Authors: Yi-Chin Huang ; Chung-Hsien Wu ; Ming-Ge Shie

This paper proposes an automatic approach to generating speech with fluency at the prosodic word level based on a small-sized speech database of the target speaker, consisting of read and fluent speech. First, an auto-segmentation algorithm is employed to automatically segment and label the database of the target speaker. A pre-trained average voice model is adapted to the voice model of the target speaker by using the auto-segmented data. For synthesizing fluent speech, a prosodic model is proposed to smooth the prosodic word-level parameters to improve the fluency in a prosodic word. Finally, a postfilter method based on the modulation spectrum is adopted to alleviate over-smoothing problem of the synthesized speech and thus improve the speaker similarity. Experimental results showed that the proposed method can effectively improve the speech fluency and speaker likeliness of the synthesized speech for a target speaker compared to the MLLR-based model adaptation method.

#10 Non-native speech synthesis preserving speaker individuality based on partial correction of prosodic and phonetic characteristics [PDF] [Copy] [Kimi¹]

Authors: Yuji Oshima ; Shinnosuke Takamichi ; Tomoki Toda ; Graham Neubig ; Sakriani Sakti ; Satoshi Nakamura

This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Cross-lingual speech synthesis based on voice conversion or HMM-based speech synthesis, which synthesizes foreign language speech of a specific non-native speaker reflecting the speaker-dependent acoustic characteristics extracted from the speaker's natural speech in his/her mother tongue, tends to cause a degradation of speaker individuality in synthetic speech compared to intra-lingual speech synthesis. This paper proposes a new approach to cross-lingual speech synthesis that preserves speaker individuality by explicitly using non-native speech spoken by the target speaker. Although the use of non-native speech makes it possible to preserve the speaker individuality in the synthesized target speech, naturalness is significantly degraded as the speech is directly affected by unnatural prosody and pronunciation often caused by differences in the linguistic systems of the source and target languages. To improve naturalness while preserving speaker individuality, we propose (1) a prosodic correction method based on model adaptation, and (2) a phonetic correction method based on spectrum replacement for unvoiced consonants. The experimental results demonstrate that these proposed methods are capable of significantly improving naturalness while preserving the speaker individuality in synthetic speech.

#11 Evaluation of state mapping based foreign accent conversion [PDF] [Copy] [Kimi¹]

Authors: Markus Toman ; Michael Pucher

We present an evaluation of the perception of foreign-accented natural and synthetic speech in comparison to accent-reduced synthetic speech. Our method for foreign accent conversion is based on mapping of Hidden Semi-Markov Model states between accented and non-accented voice models and does not need an average voice model of accented speech. We employ the method on recorded data of speakers with first language (L1) from different European countries and second language (L2) being Austrian German. Results from a subjective evaluation show that the proposed method is able to significantly reduce the perceived accent. It also retains speaker similarity when an average voice model of the same gender is used. Accentedness of synthetic speech was rated significantly lower than natural speech by the participants and listeners were unable to identify accents correctly for 81% of the natural and 85% of the synthesized samples. Our evaluation shows the feasibility of accent conversion with a limited amount of speech resources.

#12 Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features [PDF] [Copy] [Kimi¹]

Authors: Zhizheng Wu ; Simon King

Recently, Deep Neural Networks (DNNs) have shown promise as an acoustic model for statistical parametric speech synthesis. Their ability to learn complex mappings from linguistic features to acoustic features has advanced the naturalness of synthesis speech significantly. However, because DNN parameter estimation methods typically attempt to minimise the mean squared error of each individual frame in the training data, the dynamic and continuous nature of speech parameters is neglected. In this paper, we propose a training criterion that minimises speech parameter trajectory errors, and so takes dynamic constraints from a wide acoustic context into account during training. We combine this novel training criterion with our previously proposed stacked bottleneck features, which provide wide linguistic context. Both objective and subjective evaluation results confirm the effectiveness of the proposed training criterion for improving model accuracy and naturalness of synthesised speech.

#13 Combining extreme learning machine and decision tree for duration prediction in HMM based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Yang Wang ; Minghao Yang ; Zhengqi Wen ; Jianhua Tao

Hidden Markov Model (HMM) based speech synthesis using Decision Tree (DT) for duration prediction is known to produce over-averaged rhythm. To alleviate this problem, this paper proposes a two level duration prediction method together with outlier removal. This method takes advantages of accurate regression capability by Extreme Learning Machine (ELM) for phone level duration prediction, and the capability of distributing state durations by DT for state level duration prediction. Experimental results showed that the method decreased RMSE of phone duration, increased the fluctuation of syllable duration, and achieved 63.75% in preference evaluation. Furthermore, this method does not incur laborious manual alignment on training corpus.

#14 F0 parameterization of glottalized tones for HMM-based vietnamese TTS [PDF] [Copy] [Kimi¹]

Authors: Duy Khanh Ninh ; Yoichi Yamashita

A conventional HMM-based TTS system for Hanoi Vietnamese often suffers from the hoarse quality due to the incomplete F0 parameterization of glottalized tones. As estimating F0 in glottalization is rather problematic for usual F0 extractors, we propose a pitch marking algorithm where the pitch marks are propagated from regular regions of speech signal to glottalized one, from which the complete F0 contour of a glottalized tone is derived. The proposed F0 parameterization scheme was confirmed to significantly reduce the hoarseness whilst improve the tone naturalness of synthetic speech by both objective and listening tests. The pitch marking algorithm works as a refinement step based on the results of an F0 extractor. Therefore, the proposed scheme can be combined with any F0 extractor.

#15 Deep neural network context embeddings for model selection in rich-context HMM synthesis [PDF] [Copy] [Kimi¹]

Authors: Thomas Merritt ; Junichi Yamagishi ; Zhizheng Wu ; Oliver Watts ; Simon King

This paper introduces a novel form of parametric synthesis that uses context embeddings produced by the bottleneck layer of a deep neural network to guide the selection of models in a rich-context HMM-based synthesiser. Rich-context synthesis — in which Gaussian distributions estimated from single linguistic contexts seen in the training data are used for synthesis, rather than more conventional decision tree-tied models — was originally proposed to address over-smoothing due to averaging across contexts. Our previous investigations have confirmed experimentally that averaging across different contexts is indeed one of the largest factors contributing to the limited quality of statistical parametric speech synthesis. However, a possible weakness of the rich context approach as previously formulated is that a conventional tied model is still used to guide selection of Gaussians at synthesis time. Our proposed approach replaces this with context embeddings derived from a neural network.

#16 An investigation of context clustering for statistical speech synthesis with deep neural network [PDF] [Copy] [Kimi¹]

Authors: Bo Chen ; Zhehuai Chen ; Jiachen Xu ; Kai Yu

The state-of-the-art DNN speech synthesis system directly maps linguistic input to acoustic output and voice quality improvement over the conventional MSD-GMM-HMM synthesis system has been reported. DNN-based speech synthesis system does not require context clustering as in GMM-HMM systems and this was believed to be the main advantage and contributor to performance improvement. Our previous work has demonstrated that F0 interpolation, rather than context clustering, is the actual contributor for performance improvement. However, it remains unknown whether the use of unclustered context is a beneficial characteristic of DNN-based synthesis or not. In this paper, this issue is investigated in detail. Decision tree clustered contexts are used as linguistic input for DNN and compared to unclustered context input. A novel approach for inputting context clusters is proposed. Here, the decision tree question indicators are used as input instead of the clustered contexts. Experiments showed that DNN with clustered contexts significantly outperformed DNN with unclustered contexts and the proposed question indicator input approach obtained the best performance. The investigation of this paper reveals the limitation of DNN-based speech synthesis and implies that context clustering is also an important issue for DNN-based speech synthesis with limited training data.

#17 Sentence-level control vectors for deep neural network speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Oliver Watts ; Zhizheng Wu ; Simon King

This paper describes the use of a low-dimensional vector representation of sentence acoustics to control the output of a feed-forward deep neural network text-to-speech system on a sentence-by-sentence basis. Vector representations for sentences in the training corpus are learned during network training along with other parameters of the model. Although the network is trained on a frame-by-frame basis, the standard frame-level inputs representing linguistic features are supplemented by features from a projection layer which outputs a learned representation of sentence-level acoustic characteristics. The projection layer contains dedicated parameters for each sentence in the training data which are optimised jointly with the standard network weights. Sentence-specific parameters are optimised on all frames of the relevant sentence — these parameters therefore allow the network to account for sentence-level variation in the data which is not predictable from the standard linguistic inputs. Results show that the global prosodic characteristics of synthetic speech can be controlled simply and robustly at run time by supplementing basic linguistic features with sentence-level control vectors which are novel but designed to be consistent with those observed in the training corpus.

#18 Micro-structure of disfluencies: basics for conversational speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Simon Betz ; Petra Wagner ; David Schlangen

Incremental dialogue systems can produce fast responses and can interact in a human-like fashion. However, these systems occasionally produce erroneous material or run out of things to say. Humans in such situations use disfluencies to remedy their ongoing production and signal this to the listener. We devised a new model for inserting disfluencies into synthesis and evaluated this approach in a perception test. It showed that lengthenings and silent pauses can be built for speech synthesis with low effort and high output quality. Synthesized word fragments and filled pauses, while potentially useful in incremental dialogue systems, appear more difficult to handle for listeners. While we were able to get consistently high ratings for certain types of disfluencies, the need for more basic research on their micro structure became apparent in order to be able to synthesize the fine phonetic detail of disfluencies. For this, we analysed corpus data with regard to distributional and durational aspects of lengthenings, word fragments and pauses. Based on these natural speaking strategies, we explored further to what extent speech can be delayed using disfluency strategies, and how to handle difficult disfluency elements by determining the appropriate amount of durational variation applicable.

#19 Using automatic stress extraction from audio for improved prosody modelling in speech synthesis [PDF] [Copy] [Kimi¹]

Authors: György Szaszák ; András Beke ; Gábor Olaszy ; Bálint Pál Tóth

Generating proper and natural sounding prosody is one of the key interests of today's speech synthesis research. An important factor in this effort is the availability of a precisely labelled speech corpus with adequate prosodic stress marking. Obtaining such a labelling constitutes a huge effort, whereas inter-annotator agreement scores are usually found far below 100%. Stress marking based on phonetic transcription is an alternative, but yields even poorer quality than human annotation. Applying an automatic labelling may help overcoming these difficulties. The current paper presents an automatic approach for stress detection based purely on audio, which is used to derive an automatic, layered labelling of stress events and link them to syllables. For proof of concept, a speech corpus was extended by the output of the stress detection algorithm and a HMM-TTS system was trained with the extended corpus. Results are compared to a baseline system, trained on the same database, but with stress marking obtained from textual transcripts after applying a set of linguistic rules. The evaluation includes CMOS tests and the analysis of the decision trees. Results show an overall improvement in prosodic properties of the synthesized speech. Subjective ratings reveal a voice perceived as more natural.

#20 Reconstructing voices within the multiple-average-voice-model framework [PDF] [Copy] [Kimi]

Authors: Pierre Lanchantin ; Christophe Veaux ; Mark J. F. Gales ; Simon King ; Junichi Yamagishi

Personalisation of voice output communication aids (VOCAs) allows to preserve the vocal identity of people suffering from speech disorders. This can be achieved by the adaptation of HMM-based speech synthesis systems using a small amount of adaptation data. When the voice has begun to deteriorate, reconstruction is still possible in the statistical domain by correcting the parameters of the models associated with the speech disorder. This can be done by substituting those with parameters from a donor's voice, at risk of losing part of the identity of the patient. Recently, the Multiple-Average-Voice-Model (Multiple AVM) framework has been proposed for speaker adaptation. Adaptation is performed via interpolation into a speaker eigenspace spanned by the mean vectors of speaker-adapted AVMs which can be tuned to the individual speaker. In this paper, we present the benefits of this framework for voice reconstruction: it requires only a very small amount of adaptation data, interpolation can be performed in a clean speech eigenspace and the resulting voice can be easily fine-tuned by acting on the interpolation weights. We illustrate our points with a subjective assessment of the reconstructed voice.

#21 HMM based myanmar text to speech system [PDF] [Copy] [Kimi¹]

Authors: Ye Kyaw Thu ; Win Pa Pa ; Jinfu Ni ; Yoshinori Shiga ; Andrew Finch ; Chiori Hori ; Hisashi Kawai ; Eiichiro Sumita

This paper presents a complete statistical speech synthesizer for Myanmar which includes a syllable segmenter, text normalizer, grapheme-to-phoneme convertor, and an HMM-based speech synthesis engine. We believe this is the first such system for the Myanmar language. We performed a thorough human evaluation of the synthesizer relative to human and re-synthesized baselines. Our results show that our system is able to synthesize speech at a quality comparable with similar state-of-the-art synthesizers for other languages.

#22 Multiple feed-forward deep neural networks for statistical parametric speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Shinji Takaki ; SangJin Kim ; Junichi Yamagishi ; JongJin Kim

In this paper, we investigate a combination of several feed-forward deep neural networks (DNNs) for a high-quality statistical parametric speech synthesis system. Recently, DNNs have significantly improved the performance of essential components in the statistical parametric speech synthesis, e.g. spectral feature extraction, acoustic modeling and spectral post-filter. In this paper our proposed technique combines these feed-forward DNNs so that the DNNs can perform all standard steps of the statistical speech synthesis from end to end, including the feature extraction from STRAIGHT spectral amplitudes, acoustic modeling, smooth trajectory generation and spectral post-filter. The proposed DNN-based speech synthesis system is then compared to the state-of-the-art speech synthesis systems, i.e. conventional HMM-based, DNN-based and unit selection ones.

#23 Sequence-to-sequence neural net models for grapheme-to-phoneme conversion [PDF] [Copy] [Kimi¹]

Authors: Kaisheng Yao ; Geoffrey Zweig

Sequence-to-sequence translation methods based on generation with a side-conditioned language model have recently shown promising results in several tasks. In machine translation, models conditioned on source side words have been used to produce target-language text, and in image captioning, models conditioned images have been used to generate caption text. Past work with this approach has focused on large vocabulary tasks, and measured quality in terms of BLEU. In this paper, we explore the applicability of such models to the qualitatively different grapheme-to-phoneme task. Here, the input and output side vocabularies are small, plain n-gram models do well, and credit is only given when the output is exactly correct. We find that the simple side-conditioned generation approach is able to rival the state-of-the-art, and we are able to significantly advance the stat-of-the-art with bi-directional long short-term memory (LSTM) neural networks that use the same alignment information that is used in conventional approaches.

#24 Knowledge versus data in TTS: evaluation of a continuum of synthesis systems [PDF] [Copy] [Kimi¹]

Authors: Rosie Kay ; Oliver Watts ; Roberto Barra Chicote ; Cassie Mayo

Grapheme-based models have been proposed for both ASR and TTS as a way of circumventing the lack of expert-compiled pronunciation lexicons in under-resourced languages. It is a common observation that this should work well in languages employing orthographies with a transparent letter-to-phoneme relationship, such as Spanish. Our experience has shown, however, that there is still a significant difference in intelligibility between grapheme-based systems and conventional ones for this language. This paper explores the contribution of different levels of linguistic annotation to system intelligibility, and the trade-off between those levels and the quantity of data used for training. Ten systems spaced across these two continua of knowledge and data were subjectively evaluated for intelligibility.

#25 Improving G2p from wiktionary and other (web) resources [PDF] [Copy] [Kimi¹]

Author: Steffen Eger

We consider the problem of integrating supplemental information strings in the grapheme-to-phoneme (G2P) conversion task. In particular, we investigate whether we can improve the performance of a G2P system by making it aware of corresponding transductions of an external knowledge source, such as transcriptions in other dialects or languages, transcriptions provided by other datasets, or transcriptions obtained from crowd-sourced knowledge bases such as Wiktionary. Our main methodological paradigm is that of multiple monotone many-to-many alignments of input strings, supplemental information strings, and desired transcriptions. Subsequently, we apply a discriminative sequential transducer to the multiply aligned data, using subsequences of the supplemental information strings as additional features.